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Performance Analysis of Spectral Clustering on 
Compressed, Incomplete and Inaccurate 

Measurements 

Blake Hunter and Thomas Strohmer 

Abstract 

Spectral clustering is one of the most widely used techniques for extracting the underlying global structure of a data 
set. Compressed sensing and matrix completion have emerged as prevailing methods for efficiently recovering sparse and 
partially observed signals respectively. We combine the distance preserving measurements of compressed sensing and 
matrix completion with the power of robust spectral clustering. Our analysis provides rigorous bounds on how small errors 
in the affinity matrix can affect the spectral coordinates and clusterability. This work generalizes the current perturbation 
results of two-class spectral clustering to incorporate multi-class clustering with k eigenvectors. We thoroughly track how 
small perturbation from using compressed sensing and matrix completion affect the affinity matrix and in succession the 
spectral coordinates. These perturbation results for multi-class clustering require an eigengap between the k th and (k+l) th 
eigenvalues of the affinity matrix, which naturally occurs in data with k well-defined clusters. Our theoretical guarantees are 
complemented with numerical results along with a number of examples of the unsupervised organization and clustering of 
image data. 
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^! 1 Introduction 

, 1 *V ATA mining has become one of the fastest growing research topics in mathematics and computer science. 
J> ■ I J Spectral clustering is a tool for extracting meaningful information from data by grouping similar objects 
, together (TJ. The method uses the eigenvector of an adjacency matrix for embedding the data into a space that 
' captures the underlying group structure [2J. High-dimensional signals, magnetic resonance images, and hyper- 
spectral images can be costly to acquire; even simple direct comparisons could be infeasible among such data 
sets. Our work shows that the meaningful organization extracted from spectral clustering is preserved under the 
perturbation from making compressed, incomplete and inaccurate measurements. Using bounds on the perturbation 
of eigenvectors, we establish error bounds of the spectral embedding when matrix completion and compressed 
sensing measurements are used. Given some error Ne in the entries of an affinity matrix A £ R NxN , we show that 
the space spanned by the first k eigenvector are all within O(Ne) of the span of the unperturbed eigenvectors. We 
prove that the perturbed spectral coordinates are within O(Ne) of a unitary transform of the unperturbed coordinates 
■ and can give k-means cluster assignments within O(Ne) of the unperturbed case. This analysis holds true when 
the error perturbation in the entries of an affinity matrix \A(i,j) — A(i,j)\ < e is caused from making compressed 
sensing measurements, matrix completion or any other process, making our perturbed clustering results widely 
applicable. This work shows that spectral clustering is achievable in the compressed domain and with missing and 
noisy entries as long as the spectral gap is satisfied. 

As the dimensionality of data increases, data mining tasks such as clustering and classification can become 
intractable or costly to obtain. Traditional clustering algorithms must perform dimensionality reduction to make 
the problem tractable before they can be applied. Learning in the compressed domain was first proved possible 
using support vector machines in [3[. In addition to techniques for exact recovery of sparse signals, compressed 
sensing provides a bound on the error derived from making random measurements [4J, [5J. We show how errors 
from using compressed sensing can affect the affinity matrix and in turn the spectral coordinates. 

In practice, data may be missing, lost or not fully observed. There are numerous tasks where you are given only 
a small portion of the data in hope to understand the entire set. A set of high dimensional images containing k 
hidden subsets with missing entries is impossible to cluster with standard clustering methods alone due to the lack 
of information from the incomplete data. Suppose the set of images are stacked as rows of a matrix, if this matrix 
is low rank then the set of similar images contain a wealth information about the missing entries of any one of the 
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images. Matrix completion uses the information from similar rows of the data matrix to fill in the missing entries 
making clustering possible. Matrix completion is an emerging area of research that provides efficient algorithms to 
reconstruct the full matrix X from a small subset of observed entries via nuclear-norm minimization |6|, [7\. Since 
data matrices are usually not exactly low rank, the matrix completion procedure results in errors in the recovered 
entries. We analyze how these errors propagate through the spectral clustering steps and derive rigorous bounds 
under which clusterability of the data is preserved after matrix completion. An exampled is depicted in Fig. [TJ 
where face images can be successfully clustered even if only 5% of the image data are available. For details we 
refer to Section [6] 





Fig. 1 . Clustering a dataset of 100 images of three different people's faces in a range of poses from profile to frontal views where only 5% of each 
image is observed. Only two of the three people's face can be published so a happy face is used here in place of the third person's face for display 
purposes only. The images are clustered by applying the matrix completion perturbed spectral clustering coordinates (2 nd and 3 rd eigenvectors). 

The structure of the paper is as follows. Previous results on perturbation of eigenvectors and spectral clustering 
with perturbed data are presented in Section [2] Our robustness analysis is presented in Section [3] It is followed 
by a theoretical justification and a comparison to other methods. Small perturbation error in the affinity matrix 
is shown to be well behaved in the spectral coordinates in Section [4] Measurement errors due to using matrix 
completion and compressed sensing measurements are shown to give small error in the affinity matrix and in turn 
the spectral coordinates in Section [5] where bounds on the span of the first k eigenvectors under small perturbations 
are provided. Section [6] is dedicated to numerical results, where the method is applied tp both synthetic and real 
world image data sets. 

2 Background 

2.1 Spectral Clustering 

Clustering is an unsupervised learning problem that reveals the underlying structure from unlabeled data. The 
goal of clustering is to partition objects into groups such that objects within the same group are similar. Standard 
clustering such as k-means requires the space in which the objects are represented, to be linearly separable. Spectral 
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clustering methods detect non-convex patterns and linearly non-separable clusters. This allows for a wider range 
of underlying geometries, making them more flexible [T), ©• 

Standard spectral clustering uses the eigenvectors of the graph formed by local distances between data points, 
to reveal the global structure of the data set. A traditional choice of edge weights uses the Gaussian kernel, 

W(x i ,x j )=exp (- ti^M y (1) 

A random walk on the graph is defined by normalizing the rows of W to give the stochastic matrix, 

P = D~ l W, 

where 

is a diagonal matrix of row sums of W. 

The original symmetry of W, lost in P = D~ X W , can be preserved by defining A as, 

A = D~iWD~i . (2) 

Spectral clustering finds the top k eigenvectors Vu G R Arxfe of A to provide coordinates for clustering. To cluster the 
original data a standard clustering algorithm like k-means is then applied to the rows of Vk, as illustrated in (TJ. 

To show why the eigenvectors of spectral clustering works, Shi and Malik proved in [2] that the second eigenvector 
of P is the real valued solution to minimizing the normalized cut problem, that bipartitions the points of the graph. 
The graph bipartitioning problem was extended to multi-class clustering by using multiple eigenvectors as described 
in m. GDI- 



2.2 Perturbation of the second eigenvector 

Earlier results have shown that spectral clustering using the second eigenvector is robust to small perturbation of 
the data, see [11], [12]. These results are based on the following perturbation theorem by Stewart [13]. 

Theorem 1: Let A = A + E be a perturbation of A and let Xi and w, be the i th eigenvalue and eigenvector of A 
and Vi be the i th eigenvector of A respectively, then 

\\v 2 -v 2 \\<-^—\\E\\ + 0{}\Ef). (3) 

This holds provided the gap between the second and third eigenvalue is not close to zero. This is not the case of 
data sets with more than two underlying clusters. The number of eigenvalues close to one is equal to the number of 
separate clusters. Consider the simplest example, where there are three single points forming three non-connected 
clusters. 

The affinity matrix A will be the 3x3 identity and will have three eigenvalues of one, hence the gap between 
the second and third eigenvalue is zero. In general for data sets with k well separated clusters there will be 
k eigenvalues close to 1, making the eigengap A2 — A3 close to zero, and destroying the bound of the second 
eigenvector in Theorem [T] 

Small perturbations in the entries of an affinity matrix can lead to large perturbation in the eigenvectors. First 
consider the ideal clustering data set with two underlying clusters where each data point has equal similarity to 
each intraclass point and is dissimilar to each interclass point. The ordered point would partition the affinity matrix 
A into a block diagonal matrix, 

A = 

where Aj is a matrix of all ones and is the zero matrix. Let 

I = A + E 

be a perturbation of that matrix where E has entries uniformly distributed from to e. Previous analysis used in 
|[TT1 says that when e is small, the second eigenvector of A and A are close, i.e. satisfy Theorem [TJ The second 
eigenvector is a positive/negative indicator of each point's cluster membership. The vector is constant for all the 
points within each cluster, but there is an arbitrary choice of which sign is assigned to each cluster, as can be seen 
in Figure 12 With the correct choice made then the bound in Theorem [l] holds. The Euclidean distance, \\1i2 ~ V2H, 
can be large when the sign is chosen incorrectly, but what is preserved, is the space spanned by {^2} and {1)2} ■ We 
define closeness of these subspaces using canonical angles. 



(A X 0\ 
V A 2 ) 



4 



Definition 2: Let Vk and V& be subspaces spanned by the orthonormal eigenvectors Vi, . . . , Vi+k and Vi 



And let 71 < . . . < 7& be the singular values of [vi ■ ■ ■ v i+ k] T [vi ■ ■ ■ v i+ k]. Then the values, 



Ui = cos 7i 

are called the canonical angles between Vk and Vk- 

Define Vk and Vk to be close if the largest canonical angle, 9±, is small. See fT4ll , fT5ll for more details. 
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Fig. 2. 30 data points with two underlying clusters. Top:(from right to left) The affinity matrix A, the first ten eigenvalues, the first eigenvector vi 
and the second eigenvector V2. Bottom:(from right to left) The affinity matrix A, the first ten eigenvalues, the first eigenvector ui and the second 
eigenvector £2. The canonical angle between V2 and V2 is 9i = .0199. 

Now consider a data set with three underlying clusters, so 

Mi \ 

A = A 2 . 
V A 3 J 

Here the eigengap between the second and third eigenvalue is small which destroys the bound in Theorem [TJ 
Previous work [12J argues that even though this bound fails, in practice the second eigenvectors can still give the 
correct coordinates for clustering but provide no justification. Figure |3] shows that the measure of clusterability of 
the perturbed spectral coordinates is not captured by the difference between eigenvectors, 1 1 •£> 2 — U2II but by the 
canonical angle between the subspaces spanned by the eigenvectors. Even though the Euclidean distance, ||«2 — W2II 
is large, the clusterability of v 2 is maintained in w 2 . This robustness of the clusterability can be characterized by the 
small canonical angle between {i>2, ^3} and {52,^3}. 

3 Spectral Clustering on Perturbed Data 

Standard spectral clustering methods use the first k eigenvectors of A, constructed from local distances between 
points to provide a k low-dimensional representation of the data, which is used as coordinates for clustering. 
Matrix completion and compressed sensing measurements are guaranteed to give a good approximation of these 
Euclidean distances even when only a small fraction of the entries are observed or the number of measurements 
is much less than the ambient dimension. Our method merges the distance preserving dimensionality reduction 
of compressed sensing and Matrix completion with the power of spectral clustering. Previous results on spectral 
clustering on perturbed data are based on perturbation bounds when using the second eigenvector for bipartitioning. 
Our analysis generalizes these results to incorporate multi-class clustering using the top k eigenvectors. 

Assume that the local distances required for standard spectral clustering are replaced by perturbed distances. 
Define the local distance using perturbed data X as, 



d{xi,Xj) = \\xi - Xj\\ 2 . 



(4) 
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Fig. 3. 30 data points with three underlying clusters. Top:(from right to left) The affinity matrix A, the first ten eigenvalues, the second eigenvector 
02 and the third eigenvector 113. Bottom:(from right to left) The affinity matrix A, the first ten eigenvalues, the second eigenvector V2 and the third 
eigenvector v s . The normed difference \\v 2 - ua|| = 1-7738 is large where as the largest canonical angle between the column space of {v2,vz} 
and {52, 53} is 81 = .0548 is still small. 



Construct a graph with edge weights 

W(x i ,x j ) = exp (- MlZjMT ). (5 ) 

Define the symmetric N x N matrix 

A = D'^WD'^ (6) 

where D Ui = £ fc=1 W(xi,x k ). 

The first k eigenvectors Vk € M A ' X ' C of A are used as a k dimensional representation of the data. With these 
spectral coordinates preserved, k-means is applied to the rows of Vk, to cluster the original data points Xi. 

For classification with partially labeled data, the membership of an object is matched to that of its neighbors 
by performing k-means in the eigenvector domain. We quantify the error in misclassified data by defining the 
misclassification rate as, 

1 N 

»=i 

where x is the indicator function, I t is the value indicating the class membership of a;, and 7^ of Xj. 

Often when analyzing high dimensional signals, the underlying structure of interest only has a few degrees of 
freedom or is sparse in some unknown basis. We analyze two types of perturbed data X in Section [5] We show that 
instead of requiring the local distances be made in the large ambient dimension, measurements can be made on 
the order of the dimension of the hidden underlying point cloud structure. Using the controllable error from taking 
compressed sensing measurements (d(xi,Xj) — \\$Xi — $Xj\\2) and matrix completion (d(xi,Xj) — \\xi — ijlb)/ we 
establish perturbation bounds of the affinity matrix, the eigenvectors, the spectral coordinates and the clustering 
memberships. 

Assume that there is an underlying s-sparse representation yi of the data Xi, where yi — Bxi is a known or 
unknown unitary transformation of X{. Let <f> be a random m x n matrix, with Gaussian 7V(0, 1) entries. Define the 
local distance in (1J using m compressed sensing measurements as, 

Xi = $Xi. (7) 

Local distances are preserved under the perturbation of making compressed sensing measurements when the 
number of measurements m, is large enough. 
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Fig. 4. The handwritten digits {1,3} data set is projected onto the 2nd and 3rd eigenvectors. Left: From the graph formed using Euclidean 
distances between points. Center: Using distances from 30 random Gaussian measurements. Right: 128 random Gaussian measurements. 



In many applications such as collaborative filtering, computer vision and wireless sensor networks the data being 
analyzed maybe lost, damaged and only partially observed. Matrix completion reconstructs a low-rank data matrix 
from a small subset of its entries. 

Now assume X is a low rank matrix, under some constraints on the matrix known as the strong incoherence 
property X can be reconstructed from a fraction of the entries Pq(X). Define the local distance in ||4) using the 
reconstructed matrix X, where 

Xi = row i of X. 

Local distances are preserved under the perturbation of matrix completion when X obeys the strong incoherence 
property. 

Our analysis provides rigorous bounds on how small errors in the affinity matrix can affect the spectral coordinates 
and clusterability. Our analysis not only applies to compressive spectral clustering but generalizes the current 
results of spectral clustering on perturbed data to incorporate multi-class clustering with k eigenvectors. We show 
perturbation due to compressed measurements and matrix completion, preserve the affinity matrices i.e. for any 
< e < 1, given the number of measurements or observed entires large enough then \Aij — A;j| < e. With this we 
show the span of the first k eigenvectors of A is close to the span of k eigenvectors A, || &m6\\F < — . We then show 
given that the matrices \\A — A\\p < Ne, the perturbed spectral coordinate are within O(Ne) of a unitary transform 
Q of the unperturbed coordinates, \\v(i) — v(i)Q\\2 < (1 + \/2)— . When spectral clustering is preformed in the 
compressed domain or after applying matrix completion, the eigenvectors of A or A can replace the eigenvectors 
of A as coordinates for clustering and classification as seen in Figure S] and Q] 

4 Robustness of clustering under perturbation of the top k eigenvectors 

Previous results in approximate spectral clustering based on the perturbation of the second eigenvector [12J, are 
limited to bipartitioning and require assumptions on the distributions of the perturbation of the components of v 2 - 
We expand the theory of approximate spectral clustering to partitioning data with k underlying clusters by showing 
what is preserved under small perturbations is that the column space of the first k eigenvectors of A. When the 
eigengap between Ai, A 2 , . . . , Afc is small then the column space spanned by their corresponding eigenvectors will 
be close to the column space spanned by the eigenvectors of the perturbed matrix. 
Let 

V k = 

where vi is the column eigenvector corresponding to the I th largest eigenvalue of A. Similarly define 

Vi v 2 ... v k 

for vi the I th eigenvector of A. 



vi v 2 ... v k 



V k = 
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Given there is an eigengap, (X k — Afe+i), between the first k eigenvalues T, k = diag(A 2 , A2, . . . , Afe) and the last 
N — k eigenvalues, £jv-fc = diag(Afc + i, . . . , A at) of A, we write the block decomposition of the eigenvectors as 
V = [14, Visr-k] where V/v_fe = [ufc+i, ■ ■ ■ , ujv]- So the eigendecomposition can be written as, 



(V k V N ^ k f A (V k V N . k ) = ° 



and similarly for A, 



(v k v N - k y A(y k v N -k) = 



s fe 

Sw-fe 



Theorem 3: Let A^, i>i, Ai, be the i th eigenvalues and eigenvectors of A and A respectively, and let = 
diag(#i,02, ■ • ■ , 0*0 be the diagonal matrix of canonical angles between the column space of V k — [vi, v 2 , ■ ■ ■ , v k ] 
and V k = [vi,V2, ■ ■ ■ ,Vk]- If there is a gap a > such that 

|Afc - Afe + i| > a 

and 

Afe > a 

then 

||sinG|| F < -\\AV k -V k t k \\ F 
a 

where sin 9 is taken entrywise. 

This is a reformulation of the sin® Theorem of Davis and Kahan [14J. 

Note that A fe and A fc are close by the Mirsky Theorem, so the first condition requires a to be less than the eigengap 
between Afe and Afc+i. The second condition requires a to be less than the eigenvalues in the first block This 
extended the perturbation results of the second eigenvector in Theorem [TJ to bounding the perturbation of the 
column space of the first k eigenvectors. 

Corollary 4: Let Py h and Py k be the orthogonal projection on to V k and V k . If there is a a > such that X k — X k +i > a 
and X k > a, then, _ 

\\P Vk -Pv k \\F<—\\A-A\\ F . (8) 
k a 

Proof: Assume there exists a a > such that Afc — A^+i > a and Afe > a. With the eigengap Afe — A^+i > a 
and the block decomposition of V and V, Theorem [3] bounds the canonical angles 6, between the column space of 

{vi,v 2 , —,v k } and {vi, v 2 , — , v k }: 

II sine||F < -|| (A - E)V k l - V k t k \\ F < (-\\AV k - V k E k \\ F + \\EV k \\ F ) <-\\A- A\\ F . 
a \a J a 

It is shown in |15| that the norm from canonical angles and the norm from projections satisfy, 

lln%- J Pi4l|F = V / 2||sine|| F . 

Combining these gives the result. □ 
This establishes closeness between the space spanned by the first k eigenvectors of A and the first k eigenvectors 
of A, bounding the difference between the low dimensional embedding of projecting onto the first k eigenvectors 
of A and A. Euclidean geometry is essentially preserved if the eigengap is satisfied. This is shown is Theorem [5] 
and Corollary [6] 

Theorem 5: Let V k be the matrix formed by the top k column eigenvectors of A and V k , the matrix formed by the 
top k eigenvectors of A defined above. If Afe — Afc+i > a and X k > a then there is a unitary matrix Q such that 



\V k 



V k Q\\ 2 < (l + v^)i||A-i|| F . 



Proof: 

To compare V k with V k we find an unitary matrix Q such that ||Vfe — VfeQ|| F is minimum. It is shown in [16] that 
this can be found by taking the singular value decomposition of V k V kl 

r T i4 T yfcZ = diag(cos^), 

where 9{ are the canonical angles between the column space of V k and the column space of V k . Thus Q = YZ T is 
the orthogonal matrix that minimizes \\V k — Vfe<9|| F - 



s 



< 


\V, - 

I *k 


-v k v?v k \\ 2 


+ \\V k V k T V k ~V k Q\\ 2 




< 


P~ 


V k - V k V? V k \\ 2 + WVkhWVfVk - Q\\ 2 






1 P- 


~PvM\Vk 


h + \\V k \\ 2 \\V^V k -Q\\ 2 






\ F v k 


-Pv k h + \ 


V k T V k -YZ T \\ 2 {\\V k \\ 2 


— i — iit/ ii "\ 

- J- — II Vk\\2) 




\ P V k 


~Pv k h+\ 


F(cos9)Z T -YZ T \\ 2 






\ P % 


-Pv k h + \ 


r|| 2 ||cose-/|| 2 ||z T || 2 








-iV fc || 2 + | 


cos6-/|| 2 (Y and Z 


are unitary) 


< 


\ p v k 


-PvJf + 


|cos9-J||jr 





Since, cos 0, are singular values, cos 6i are positive for all i, which means, 

l + 2cos 2 6» 4 < l + 2cos^ 
1 - 2 cos 9i + cos 2 9i < 1 - cos 2 t 
(cosdi - l) 2 < sin 2 9 t . 

So we have, 



(9) 



cos0 — I\\f 



< 



. ^(sin 2 ^) by® 
\ i=i 



sine I 



Applying <(8j to 

||Vfc - V k Qh < \\Py k ~ Pv k \\f + II sine|| F 

gives the result. □ 
Each data point can be clustered by applying a clustering algorithm, such as k-means or PCA, to the rows of 
Vk or Vk, see [1J, [17\. The rows of V k and V k both provide coordinates for clustering and classification. To analyze 
the clusterability we need to compare the rows of V k and V k . Corollary [6] shows that the spectral coordinates' 
Euclidean geometry can preserved under perturbation if the eigen gap is satisfied. This generalizes the current 
perturbation results of bipartitioning data by thresholding the second eigenvector to incorporate k-way clustering 
with k eigenvectors. 

Corollary 6: Let v(i) be the i th row of the matrix V k formed by the top k column eigenvectors of A and v(i) be 
the i th row of V k formed by the top k eigenvectors of A. If \ k — \ k+ \ > a and X k > a then 

||iJ(i) - w(i)Q||2 <{l + V2)~\\A-A\\ F 

a 

where Q is the orthogonal matrix that minimizes ||Vfe — V k Q\\F- 
Proof: 

\\v{i) - v{i)Q\\ 2 < \\V-VQh < (1 + V2)-\\A-A\\ F 

a 

□ 

5 Perturbation from Compressible, Incomplete and Inaccurate Measurements 

The theory that we have developed shows how small perturbations in the affinity matrix affect the spectral 
coordinates. This analysis is based on having knowledge of the amount error in \\A — A\\p. Direct knowledge of 
the error in the affinity matrix arises naturally in many areas such as wireless sensor networks, where information 
can be lost or corrupted between independent spatially distributed sensors. More commonly in practice this error 
is not known directly, but what is observed is the error in the data it self X — X. We analyze how error developed 
from taking compressible, incomplete and inaccurate measurements affects the affinity matrix and the spectral 
coordinates. 
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5.1 Perturbation of the entries of A from taking compressed sensing measurements 

Traditionally, spectral clustering methods use local Euclidean distance, d(xi,Xj) = \\xi — Xj\\% to create the affinity 
matrix A. We show that A defined using compressed sensing measurements can be made arbitrarily close to A. We 
prove that V& can be made arbitrarily close to a unitary transform of Vu an d the standard spectral coordinates can 
be replaced by the compressed spectral coordinates to provide the same clustering assignments. 

5.2 Compressed Sensing Background 

Dimensionality reduction or low dimensional representation has become a central problem in signal and image 
processing. As the dimensionality of data increases, data mining tasks such as clustering and classification can 
become intractable or costly to obtain. Traditional clustering algorithms must perform dimensionality reduction to 
make the problem tractable before they can be applied. Usually some type of transform is required to be computed 
to produce a sparse vector to make clustering feasible. In [3[ it was shown that learning using support vector 
machines can be done in the measurement domain without having to transform the data to a sparse representation. 
We provide detail of how and how well spectral clustering works in the measurement domain through careful 
analysis. 

Compressed sensing provides techniques for exact recovery of sparse signals x from random measurements 
z = $x, where $ is a random m x n matrix. In general this is an ill-posed problem, but the assumption of sparsity 
makes recovery possible, [4|, [5|, [18|. A major result of compressed sensing proves that exact recovery of sparse 
signals can be guaranteed when the number of measurements m = 0(s log n/s), which is much less than the ambient 
dimension n, see [5], [4J and the references therein. A central idea of compressed sensing is the restricted isometry 
property. 

Definition 7: The restricted isometry property (RIP) holds with parameters (r, S) where 5 € (0, 1) if 

(l-S)\\x\\ 2 < ||frE||a < (l + S)\\x\\ 2 . (10) 

holds for all s-sparse vectors x. 

It has been shown that random Gaussian, Bernoulli, and partial Fourier matrices satisfy the RIP with high 
probability [19). 

There is a large body of work that uses random projections for dimensionality reduction. Most of these methods 
are shown to work well in practice but have no theoretical uniform guarantees. Compressed sensing provides 
numerical bounds on the error produced when taking random measurements with uniform guarantees. 

Theorem 8: Let W id = e- "*'^" 2 and A = D^^WD- 1 / 2 where A,i = Ef=i W{x u x k ). And let A = D~^WD-^ 

where Dii = z2k=i W(Xi,Xk)- If Wij = e ^ where the XiS are s-sparse and <f> satisfies the RIP with 



4max . J |Mi_MI}' 

then for < e < 1, 

\Aij — Aijl < e. 



Proof: 

By the quadratic form of the restricted isometry property (RIP) from [19[ we have, 

(l-«y)||aJi-Xi||a < Hfei-fej-Hl < (1 + 5)11^-^11^. (11) 
Multiplying by ^= and exponentiating gives, 

e-t 1 -*) — 55 — > e 5s > e -( 1 + A ) — 25 — ) 

which can be rewritten as, 



W hJ e s ^^- > Wij > W^e~ 5 ^^. (12) 



Hence, 



3 

Letting C = max M jifcsdil \ 



e~ 5 



JC sn.vn. .„SC 
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and taking the square root gives, 

y f D~ l e^ c > \[ik,i > y^D~d c . (13) 



Dividing CE} by \ /->,., \ I),.,, 

ir,, „ ; ir, , 

-.e 



and by using inequality J131 gives, 



' D i,i\l D j,j \/ D i,i\/ D j,i 



Subtracting by Aj^ we have, 

Ajj(e 1) > A;.j > Ai j(& 1) 

Since e 2SC - 1 > 1 - e" 25c for all S > 0, and A 4J = ^_ 1/a < 1 



To bound (14)| by e requires <5 to satisfy, 



- A hJ \ < e 25c - I. (14) 
log(e + 1) . 

By simple calculation it can be shown that log ^^ > -^e for < e < 1. Thus, we need S — so that 

holds for all i, j. □ 
Corollary 9: Let $ be a m x n Gaussian matrix and let < e < 1, 6 = r L _ w ira -i ■ Then with high probability, 

: , [ / : ' | 

satisfies RIP with parameters (r,S) provided that the number of measurements, 

T TL 

m = 0(— log— ). 

Proof: Assume Xi is s-sparse then 2s-sparse. By the Gaussian measurement matrix Theorem in 11191 

the number of measurements required for $ to satisfy the PJP with parameters (2s, 6) with high probability is 
m > log ™ . For S = 1 p- this gives, 

□ 

The entries of A can be made arbitrarily close to the entries of A when taking m = 0(-% log -£-) random Gaussian 
measurements. Additional bounds on the number of measurements can be a found when using matrices with 
random Bernoulli entries or random rows of the Fourier transform |[i"9l . Random Gaussian matrices are used here 
to demonstrate the order of the number of measurements required to achieve our bounds. 

Corollary 10: Let A and A be define as in Theorem [8) then 

||A-I|| F < Ne. 

Proof: 



□ 

Theorem [TT1 illustrates the robustness of the spectral clustering coordinates under small perturbations from taking 
compressed measurements. 

Theorem 11: Let A and A be defined as in Theorem [8] with v(i) and v(i) defined in Corollary [6] Given there is a 
a > such that Xk — \k+i > a and Xk > a, if the number of measurements 

T 77 

m = 0(— log-r-), 
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then with high probability, 



Ne 



\\v(i) - v(i)Q\\ 2 < (l + v^) — 



Proof: Assume there exists a a > such that A/. — Xk+i > a an d Xk > a. Given an e > 0, \Ai j — Ai j\ < e by 
Theorem [8] This gives 

\\A-A\\ F = fciAj - ~M,f < fcy = vW. 



With the eigengap Afc — Afc + i > a and the block decomposition of V and V", Theorem [5] bounds the column space 

of {vi,v 3 , ...,Vk} and {v x ,v 2 , —,v k } by 



||Vfe-F fe Q|| 2 < (i + v^)I||A-i||j 



The spectral clustering coordinates 



||t5(*)-«(*)0||a < 1 + V2) 
can then be bounded by Corollary [6] Combining these gives the result. 



iVe 
a 



□ 



5.3 Perturbation of the entries of A from missing measurements 

The size and complexity of data grows exponentially with advancing technology. Often data can be lost, noisy or 
corrupted or acquiring data could be costly to obtain, as in medical MRI acquisition. Clustering algorithms can 
not be applied directly to incomplete data. Given a fraction of the entries of a one wishes to recover the missing 
entries under the constraint that the unknown matrix is low rank. This non-convex low rank minimization can be 
solved using nuclear-norm convex relaxation. Matrix completion is the task of recovering an unknown matrix form 
a small subset of its entries. This is possible for low rank matrices under some constraints on the matrix known as 
the strong incoherence property with 0(rank(X) x N\og 2 N) samples via nuclear-norm convex optimization [6), 

Definition 12: A matrix X obeys the strong incoherence property with parameter ^ if the following hold. 
1) Let Pz (resp. Py) be the orthogonal projection onto the singular vectors z\, z r (resp. y\, y r ) of X e M Arx ™ 
of rank r. For all pairs (a, a') e [N] x [N] and (b, b') e [n] x [n], 



[e a ,P z e a > 



AT 



< 



N 



{e b ,P Y e b >) lb=b> 

n 



< H ■ 

n 



2) Let 3 be the "sign matrix" defined by 



For all (a, b) G [N] x [n], 



ie[r\ 



< 



JNn' 



We will make use of the following theorem, which is a reformulation of Theorem 7 rigorously proved in ECTll . 

Theorem 13: [20] Let X e 'R Nxn be a fixed matrix of rank r obeying the strong incoherence property with 
parameter [i. Suppose we observe a fraction p = (# of entries observed)/ (Nn) of entries of X with locations 
sampled uniformly at random with noise ||Pq(X) — Pn(^)||F < S. Then with high probability the solution to the 
matrix completion problem X satisfies, 



\X-X\\ F <4:\ 



(2 + p) min(A^, n) 



5 + 25. 



This provides a bound on the recovery error from matrix completion with noisy observations. Using matrix 
completion with 0(r x TV log 2 N) samples to define a local distance we show that A can be made arbitrarily close 
to A. 
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Theorem 14: Let W itj = e~ £ 2 and A = D^^WD- 1 ' 2 where A,i = J2k=i W(x 4 , x k ). And let W itj = 

e 5^ and A = D~ 2 WD~ 2 where Dj < = zJfe=i ^(^ij ^fe)- Assume the data matrix X obeys the strong 
incoherence property with parameter /1 under the same assumptions of Theorem [13] If 



/(2 + P )ndn(*r,n) 



16 maxi j- 



I 2ct J 



then with high probability 

— Ai j I < e. 



From the performance guarantees for matrix completion problem with noise proved in Theorem 7 of [20] we 
have, 

\\X-X\\ F < A ^+P)^,n) s + 2§ d g ^ (15) 

N 



|X-l||^^||x. i -x l ||< 7 2 



i=l 

implies 

1 1 a* - x%h ^ 7- 

Applying the triangle inequality gives, 

- %||2 < \\Xi - Xj\\2 + \\Xi - Xj\\ 2 + \\Xi - Xj\\ 2 < \\Xi - Xj\\2 + 27. 

Similarly, 

\\xi - Xj\\ 2 < \\xi - Xj\\ 2 + 27. 

Combining these gives, 

|| it - Xj\\ 2 - 27 < - XjHa < \\xi - Xj\\ 2 + 27. (16) 

Squaring gives, 

\\xi - Xj\\l - 2j\\xi - Xj\\ 2 + 47 2 < - Xj\\ 2 < \\xi - x 3 \\l + 2^f\\xi - xA\ 2 + 4j 2 . 
Multiplying by ^ and exponentiating gives, 

4-y|| 3 : i -a ; .,-|| 2 -4 7 2 -lljij-jSjlll -4 7 ||xj-»j H2-47 2 -ll^i-xjlll 

e 2 " e 2 " > e 2 " > e 2 " e 2 " , 

which can be rewritten as, 

4 7 ||x i -a; -|| 2 -4 7 2 „ - 4 7 1| i, - % ■ || 2 - 4 T 2 

W w e * > Wij > W^e * . (17) 

Hence, 

— , 4 7 ||^ i -x -|| 2 -4-, 2 — „ — _4 T || 2 ; i - a; .|| 2 -4 T 2 

£ w ije — = — > £ > E ^ e — 53 — ■ 

3 3 3 



Letting C = 4 max, { 11 "'^ 11 2 }, 
and taking the square root gives, 



1 iC 4-y2 / - , yC i~t 2 

fD~ e —-— > JA,i > /D~e— . (18) 



Dividing G3 by JD iti JD jd , 
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and by using inequality flBl l gives, 



Subtracting by A L j we have, 

A^ c - 1) > A id - Aij > AA e ~ 2j ° - 1) 
Since e 2 ^ c - 1 > 1 - e~ 2 ^ c for all 7 > 0, and A id = _™ iJ _ 1/2 < 1 

\Aij- A itj \ <e^ c -\. (19) 

To bound (fl9b by e requires 7 to satisfy, 

log(e+l) 
7 - 2C* ' 

By simple calculation it can be shown that log ^ 1 ^ > for < e < 1. Thus, we need 7 = j^, so that 

holds for all □ 
Theorem 15: Let ^4 and A be defined as in Theorem fl4l with t>(i) and v(i) defined in Corollary [6] Given there is 
a a > such that A/c — Afc + i > a and Afe > a, if 



/(2 + p)ndn(jy,n) J + M 



16 maxi j 



then with high probability 

\\v(i) - v(i)Q\\ 2 < (1 + V2) 



I 2ct J 



Proof: The proof is similar to that of Theorem [IT] by combining Theorem [TJJ and Corollary [6] □ 

6 Numerical Results 

The experiments in this section were preformed on three grayscale image sets, a synthetic set, a set of face images 
and a set of handwritten digits. The spectral clustering method requires local distances between data points. Each 
data point is an image and the distances between images, d(xi,Xj), are as defined above using Frobenius norm 
distance [21] and compressed sensing measurements. 

6.1 Synthetic - Advertisers clustered by keywords 

This first experiment is clustering advertisers and keywords for a search engine query. Each advertiser Xi only 
pays at most r keywords that they want to be connected to their product, making Xi E R n , s-sparse in n keywords 
feature space. The goal is to groups the advertisers such that advertisers within the same group have pay for 
similar keywords to identify a wider range of targeted keywords to advertisers. Spectral clustering provides a 
balance of cluster compactness, conductance and proportional size. Here synthetic feature vectors are created from 
an unknown unitary sparse transform with a varying sparsity level. Figure [5] shows the clustering of a data set with 
two hidden data clouds. The left Figure shows clustering using only 10 random measurements and the middle using 
30, compared to standard techniques using the full image on the right. Both methods provide a linear separable 
dimension reduction but our compressive method is only using three measurements compared to requiring the full 
dimension. The number of measurements used here is taken much lower than our theoretical requirement yet it 
performs at the same perfect classification rate. 
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Eigenfunctions with dim-100 spar-3 gps-2 objs-200 Eigenfunctions with dim=100 spar=3 gps=2 objs=200 Eigenfunctions with dim=100 spar=3 gps=2 objs=200 




Fig. 5. Data set of synthetic 100 dimensional 3-sparse image vector with two underlying clusters, each image is projected onto the first three 
nontrivial eigenvectors. Left: Using the full 100 dimensional image vector (standard spectral clustering) Center: Using 3 compressed sensing 
measurements of each image. Right: Using 30 compressed sensing measurements. 




Fig. 6. 50 - 220 x 220 images of two different faces in a range of poses from profile to frontal views. Far Left: Unordered. Left Center: Ordered by 
the second eigenvector from standard spectral clustering. Right Center: By compressive spectral clustering using 100 measurements. Far Right: 
Using 1000 measurements. 



6.2 Face Database 

This experiment uses a database of 36 112x92 pixel images of the same person, rotating his head, from the UMIST 
Face Database [22] . It is well known that images have a sparse Fourier and wavelet transform, a fact used in JPEG 
and JPEG2000 image compression, but the optimal sparsifying transform here is unknown. The ideal transform 
would only capture the desired differences in a given set. Here it would transform the image vector into a signal of 
sparsity level proportional to the degrees of freedom of rotation and/ or the number of subjects used. In Figure |6] the 
ambient dimension is 10,000 and only 10 measurements are used. The second eigenvector of A and A both capture 



15 



the underlying rotation from their shuffled order but our method uses 1/1000 the number of measurements. 

Often a signal may not be obtainable, but random measurements of the signal might be. For example, images 
where only the inner products with random rows of the Fourier transform are observed. We show that with m 
random measurements, a set of images can be clustered by their measurements in the same way as if the clustering 
was performed on the full images themselves. 



Classification of three face images 



— * — Compressed 
Full Signal 




10 12 



2 X Gaussian Measurements 



Fig. 7. A dataset of 100 images of three different people's faces in a range of poses from profile to frontal views. The images are classified by 
applying k-means to the compressive spectral clustering coordinates (2 nd and 3 rd eigenvectors), using a range of measurements. The average 
misclassification rates of 100 trials are plotted against the average misclassification rate resulting from standard spectral clustering applied to the 
full images. 

In comparing our compressed results with standard spectral clustering we achieve the same classification rate 
with fewer measurements than full dimension of the images. A misclassification rate of .1 is achieved using only 
2 8 measurements where as the full 2 13 dimensional signal is required in standard spectral clustering, as seen in 
Figure Here the clustering was performed on a set of images of three different people. There is more than one 
way to cluster a set of more than two faces. In addition to grouping person A, B and C, the images could be 
grouped by skin tones, hair color or male/female, all of which present a valid classification. Our method is only 
guaranteed to perform as well as standard spectral clustering. Because spectral clustering is unsupervised, given no 
labeled examples as inputs, the natural groupings found by spectral clustering may fail to match a desired labeling 
causing a higher misclassification rate. 

6.3 Handwritten Digits 

The sparsifying transform here is again unknown. The ideal transform would transform the image vector into a 
signal of sparsity level proportional to the number of digits {0, 1, . .,9} to be classified. Figures |4] and [8] show the 
clustering of pairs on handwritten digits, using standard spectral clustering on the left and compressive spectral 
clustering on the right. The structure for the column space of {vi, V2, U3, V4} is maintained under the perturbation 
from making compressed sensing measurements. The clusters defined by the first eigenvectors of A from using 
the true Euclidean distances is equal to the clusters defined by the first eigenvectors of A from taking compressed 
measurements. 

6.4 Synthetic Images with missing entries 

Here we use a data set of 1000 synthetic images of three different classes where we can control the rank of the data 
matrix. In both experiments only 10% of the entries are observed. The first experiment varies the approximate rank 
of the data where the images are stacked as row vectors. As the rank of the matrix is artificially increased the number 
of underlying clusters is blurred. Figure [10] that as the approximate rank increases so does the misclassification rate. 
The second experiment uses a combination of the two perturbation errors, from matrix completion and compressed 
measurements. Figure [10] shows that as the number of measurements increase the span of the compressed spectral 
coordinates with matrix completion is close to the traditional spectral coordinates. 
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Fig. 8. The handwritten digits {1, 3, 7} data set is projected onto the 2nd and 3rd eigenvectors of the graph formed Left: using Euclidean distances 
between points, Middle: using distances from 32 Gaussian measurements, Right: using distances from 128 random Gaussian measurements. 



M=L+E where M is approximately rank 3 
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Fig. 9. M is a rank 3 matrix with 1000 images (three clusters) of dimension 500 stacked as rows with only 10% of the entries observed. The 
images are classified by applying the matrix completion then using the first three spectral clustering coordinates (1 st , 2 nd and 3 rd eigenvectors). 
The number of clusters is pertubed by increasing the rank of M. 
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